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Abstract 

The Minimum Description Length principle for online sequence estima- 
tion/prediction in a proper learning setup is studied. If the underlying model 
class is discrete, then the total expected square loss is a particularly inter- 
esting performance measure: (a) this quantity is finitely bounded, implying 
convergence with probability one, and (b) it additionally specifies the con- 
vergence speed. For MDL, in general one can only have loss bounds which 
are finite but exponentially larger than those for Bayes mixtures. We show 
that this is even the case if the model class contains only Bernoulli distribu- 
tions. We derive a new upper bound on the prediction error for countable 
Bernoulli classes. This implies a small bound (comparable to the one for 
Bayes mixtures) for certain important model classes. We discuss the applica- 
tion to Machine Learning tasks such as classification and hypothesis testing, 
and generalization to countable classes of i.i.d. models. 
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1 Introduction 



"Bayes mixture", "Solomonoff induction", "marginalization" , all these terms refer 
to a central induction principle: Obtain a predictive distribution by integrating 
the product of prior and evidence over the model class. In many cases however, the 
Bayes mixture is computationally infeasible, and even a sophisticated approximation 
is expensive. The MDL or MAP (maximum a posteriori) estimator is both a common 
approximation for the Bayes mixture and interesting for its own sake: Use the model 
with the largest product of prior and evidence. (In practice, the MDL estimator is 
usually being approximated too, since only a local maximum is determined.) 

How good are the predictions by Bayes mixtures and MDL? This question has 
attracted much attention. In many cases, an important quality measure is the 
total or cumulative expected loss of a predictor. In particular the square loss is 
often considered. Assume that the outcome space is finite, and the model class 
is continuously parameterized. Then for Bayes mixture prediction, the cumulative 
expected square loss is usually small but unbounded, growing with Inn, where n is 
the sample size jCB90| jHut03b| . This corresponds to an instantaneous loss bound 
of -. For the MDL predictor, the losses behave similarly |Ris96| IBRY98j under 
appropriate conditions, in particular with a specific prior. (Note that in order to do 
MDL for continuous model classes, one needs to discretize the parameter space, see 
also [EM].) 



On the other hand, if the model class is discrete, then Solomonoff's theorem 
|Sol78l IHutOlj bounds the cumulative expected square loss for the Bayes mixture 
predictions finitely, namely by \nw~ , where u> M is the prior weight of the "true" 
model u. The only necessary assumption is that the true distribution u is con- 
tained in the model class, i.e. that we are dealing with proper learning. It has been 
demonstrated |GL04j . that for both Bayes mixture and MDL, the proper learning 
assumption can be essential: If it is violated, then learning may fail very badly. 

For MDL predictions in the proper learning case, it has been shown [P H04aj 
that a bound of w~ l holds. This bound is exponentially larger than the Solomonoff 
bound, and it is sharp in general. A finite bound on the total expected square loss 
is particularly interesting: 

1 . It implies convergence of the predictive to the true probabilities with probabil- 
ity one. In contrast, an instantaneous loss bound of - implies only convergence 
in probability. 

2. Additionally, it gives a convergence speed, in the sense that errors of a certain 
magnitude cannot occur too often. 

So for both, Bayes mixtures and MDL, convergence with probability one holds, 
while the convergence speed is exponentially worse for MDL compared to the Bayes 
mixture. (We avoid the term "convergence rate" here, since the order of convergence 
is identical in both cases. It is e.g. o(l/n) if we additionally assume that the error 
is monotonically decreasing, which is not necessarily true in general). 
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It is therefore natural to ask if there are model classes where the cumulative loss 
of MDL is comparable to that of Bayes mixture predictions. In the present work, 
we concentrate on the simplest possible stochastic case, namely discrete Bernoulli 
classes. (Note that then the MDL "predictor" just becomes an estimator, in that it 
estimates the true parameter and directly uses that for prediction. Nevertheless, for 
consistency of terminology, we keep the term predictor.) It might be surprising to 
discover that in general the cumulative loss is still exponential. On the other hand, 
we will give mild conditions on the prior guaranteeing a small bound. Moreover, it is 
well-known that the instantaneous square loss of the Maximum Likelihood estimator 
decays as - in the Bernoulli case. The same holds for MDL, as we will see. (If 
convergence speed is measured in terms of instantaneous losses, then much more 
general statements are possible |Li99| IZEa04j . this is briefly discussed in Section HJ) 

A particular motivation to consider discrete model classes arises in Algorithmic 
Information Theory. From a computational point of view, the largest relevant model 
class is the class of all computable models on some fixed universal Turing machine, 
precisely prefix machine jLV9T] . Thus each model corresponds to a program, and 
there are countably many programs. Moreover, the models are stochastic, precisely 
they are semimeasures on strings (programs need not halt, otherwise the models 
were even measures). Each model has a natural description length, namely the 
length of the corresponding program. If we agree that programs are binary strings, 
then a prior is defined by two to the negative description length. By the Kraft 
inequality, the priors sum up to at most one. 

Also the Bernoulli case can be studied in the view of Algorithmic Information 
Theory. We call this the universal setup: Given a universal Turing machine, the 
related class of Bernoulli distributions is isomorphic to the countable set of com- 
putable reals in [0, 1]. The description length Kw{d) of a parameter i? e [0, 1] is then 
given by the length of its shortest program. A prior weight may then be defined 
by 2- Kw &\ (If a string x = x\%2 ■ ■ ■ Xt-i is generated by a Bernoulli distribution 
with computable parameter t?q £ [0)1]? then with high probability the two-part 
complexity of x with respect to the Bernoulli class does not exceed its algorithmic 
complexity by more than a constant, as shown by Vovk |Vov97j . That is, the two- 
part complexity with respect to the Bernoulli class is the shortest description, save 
for an additive constant.) 

Many Machine Learning tasks are or can be reduced to sequence prediction tasks. 
An important example is classification. The task of classifying a new instance z n 
after having seen (instance, class) pairs (zi, ci), (z n -i, c n _i) can be phrased as 
to predict the continuation of the sequence Z\C\...z n -\c n -\Z n . Typically the (in- 
stance, class) pairs are i.i.d. Cumulative loss bounds for prediction usually generalize 
to prediction conditionalized to some inputs |PH05j . Then we can solve classification 
problems in the standard form. It is not obvious if and how the proofs in this paper 
can be conditionalized. 

Our main tool for obtaining results is the Kullback-Leibler divergence. Lemmata 
for this quantity are stated in Sectional Section El shows that the exponential error 
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bound obtained in [PH04aJ is sharp in general. In Section^ we give an upper bound 
on the instantaneous and the cumulative losses. The latter bound is small e.g. under 
certain conditions on the distribution of the weights, this is the subject of Section 
El Section El treats the universal setup. Finally, in Section [7| we discuss the results 
and give conclusions. 



2 Kullback-Leibler Divergence 

Let B = {0, 1} and consider finite strings x G B* as well as infinite sequences 
x <00 G B°°, with the first n bits denoted by x\ :n . If we know that x is generated 
by an i.i.d random variable, then P(xi — 1) = $o for all 1 < i < £(x) where £(x) 
is the length of x. Then x is called a Bernoulli sequence, and $0 G C [0, 1] the 
true parameter. In the following we will consider only countable B, e.g. the set of 
all computable numbers in [0, 1]. 

Associated with each 6 6, there is a complexity or description length Kw($) 
and a weight or (semi)probability w$ = 2~ Kw ^\ The complexity will often but 
need not be a natural number. Typically, one assumes that the weights sum up 
to at most one, ^2$ ee u)tf < 1. Then, by the Kraft inequality, for all $ G there 
exists a prefix-code of length Kwi^d). Because of this correspondence, it is only a 
matter of convenience whether results are developed in terms of description lengths 
or probabilities. We will choose the former way. We won't even need the condition 

w # — 1 f° r mos t of the following results. This only means that Kw cannot be 
interpreted as a prefix code length, but does not cause other problems. 

Given a set of distributions 6 C [0, 1], complexities (Kw($))^ ee , a true distribu- 
tion $0 £ an d some observed string x G B*, we define an MDL estimator 1 : 

$ x = aremaxjmflPfrrl'i!?) j. 

Here, P(x| , (9) is the probability of observing x if $ is the true parameter. Clearly, 
P(x\$) = — tf^w -1 ^ where H(a;) is the number of ones in x. Hence P(x\d) 

depends only on £(x) and H(a;). We therefore see 

^ = 0(a,n) = argmax / w U<*(1 _^)l-«) n l (1) 

■dee 

= arg min\n- D(a\\ii}) + Kw(d) ■ In 2}, 

where n = £(x) and a := is the observed fraction of ones and 

£>(a||tf) =cdnf + (l-a) ln^f 

Precisely, we define a MAP (maximum a posteriori) estimator. For two reasons, our definition 
might not be considered as MDL in the strict sense. First, MDL is often associated with a specific 
prior, while we admit arbitrary priors. Second and more importantly, when coding some data x, 
one can exploit the fact that once the parameter -d x is specified, only data which leads to this 
$ x needs to be considered. This allows for a description shorter than Kw^). Nevertheless, the 
construction principle is commonly termed MDL, compare e.g. the "ideal MDL" in VL00 . 
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is the Kullback-Leibler divergence. The second line of also explains the name 
MDL, since we choose the i? which minimizes the joint description of model d and 
the data x given the model. 

We also define the extended Kullback-Leibler divergence 

D a ($\\#) =aln2 + (l-a)ln— i = D(a\\#) - D(a\\#). (2) 
i? l—i? 



It is easy to see that D a ($\\$) is linear in a, = D(i?||i?) and D*( 

-£>(#||#), and £L> a (i?||i?) > iff i? > 0. Note that L> Q (i?||i?) may be also defined 
for the general i.i.d. case, i.e. if the alphabet has more than two symbols. 

Let i?, i? € B be two parameters, then it follows from (JI} that in the process of 
choosing the MDL estimator, i9 is being preferred to i? iff 

n£> a (i?||i?) >ln2-(Aw(i?)-Aw(^)) (3) 

with n and a as before. We also say that then i9 beats i9. It is immediate that for 
increasing n the influence of the complexities on the selection of the maximizing 
element decreases. We are now interested in the total expected square prediction 
error (or cumulative square loss) of the MDL estimator 
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In terms of |PH04aj . this is the static MDL prediction loss, which means that a 
predictor/estimator d x is chosen according to the current observation x. (As already 
mentioned, the terms predictor and estimator coincide for static MDL and Bernoulli 
classes.) The dynamic method on the other hand would consider both possible 
continuations xO and xl and predict according to i} x0 and i9 xl . In the following, 
we concentrate on static predictions. They are also preferred in practice, since 
computing only one model is more efficient. 

Let A n = : < k < n). Given the true parameter i? and some n G N, the 
expectation of a function : {0, . . . , n} — > R is given by 

E f(n) = p(a\n)f(an), where p(a\n) = (?) (i?£(l - i^o) 1 "")". (4) 

(Note that the probability p(a\n) depends on i?o, which we do not make explicit in 
our notation.) Therefore, 

oo oo 

E(r- - i? ) 2 = ££ p(a\n)(^ - i9 ) 2 , (5) 

n=l n=l adA n 

X x X 

Denote the relation / = 0(g) by / < g. Analogously define ">" and "=". From 
[PH04SJ Corollary 12], we immediately obtain the following result. 
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Theorem 1 The cumulative loss bound ^ n E(f 1: " — $ ) 2 < 2 Kw ^ holds. 

This is the "slow" convergence result mentioned in the introduction. In con- 
trast, for a Bayes mixture, the total expected error is bounded by Kw($o) rather 
than 2 Ja "(*°) (see ftoTTSj or jHutOll Th.l]). An upper bound on ^ n E(f I: » - tf ) 2 
is termed as convergence in mean sum and implies convergence $ Xl:n — > $ with 
probability 1 (since otherwise the sum would be infinite). 

We now establish relations between the Kullback-Leibler divergence and the 
quadratic distance. We call bounds of this type entropy inequalities. 

Lemma 2 Let & G (0, 1). Let 0* = argmin{|#— ~|, \& — ~|} ; i.e. ■&* is the element 
from which is closer to |. Then the following assertions hold. 

(i) D{^\\S) > 2-(tf-tf) 2 V G (0,1), 



(Hi 
(iv 



D(m) < f(tf-tf) 2 if#,$e[\,l], 



[vn 

(via 



20*(l-0*) "J " — 4 

(v) > i5(lnil-lntf- 1) V G (0,1), 

vi) D(^l^) < ±tf i/tf< < §, 

D(0||0 • 2-J) < j-i? if$<\and]>\, 

L>(tf||l -2~ J ) < j if$<\andj>l. 



Statements (Hi) — (viii) have symmetric counterparts for $ > |. 

The first two statements give upper and lower bounds for the Kullback-Leibler di- 
vergence in terms of the quadratic distance. They express the fact that the Kullback- 
Leibler divergence is locally quadratic. So do the next two statements, they will be 
applied in particular if i? is located close to the boundary of [0, 1]. Statements (v) and 
(vi) give bounds in terms of the absolute distance, i.e. "linear" instead of quadratic. 
They are mainly used if d is relatively far from i?. Note that in (v), the position of 
d and d are inverted. The last two inequalities finally describe the behavior of the 
Kullback-Leibler divergence as its second argument tends to the boundary of [0, 1]. 
Observe that this is logarithmic in the inverse distance to the boundary. 

Proof, (i) This is standard, see e.g. |LV97j . It is shown similarly as (Hi). 

(ii) Let f(rj) = D($\\r}) - §(77 - $) 2 , then we show f(rj) < for 77 G [±, §]. We 
have that /($) =0 and 

J KIJ 77(1-77) 3 V/ ; 
This difference is nonnegative if and only 77 — $ < since 77(1 — 77) > ^. This implies 

m<o- 
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(Hi) Consider the function 

f(rj) = D(0\\r,) - 



(0 - 7]f 



2 max{tf, 7?}(1 — max{0, 77}) 



We have to show that f(rf) > for all 77 G (0, |]. It is obvious that /(0) = 0. For 
?7<0, 

JU) ^{1-7]) 0(1-0) ~ 

holds since 77 — < and 0(1 — 0) > 77(1 — 77). Thus, firf) > must be valid for 
77 < 0. On the other hand if 77 > 0, then 



/'fa) 



77 — 
77(1-77) 



77-0 (t7-0) 2 (1 -277) 



77(1 — 77) 27/ 2 (l — 7/) 5 



> 



is true. Thus /fa) > holds in this case, too. 
(iv) We show that 



f( V ) = Dfalfa) 



for 77 G [§,30]. If 77 < 0, then 

/'fa) 



3(0-7^ 



2 max{0, ?7}(1 — max{0, 77}) 



< 



77 — 7? 3(77 — 7?) 



77(1-77) 0(1-0) 
since 37/(1 - 77) > 0(1 - 77) > 0(1 - 0). If 77 > 0, then 



> 



/'fa) 



7/ — 
77(1-77) 



-3 



77-0 (77-0) 2 (1-2t7) 



77(1 — 77) 2rj 2 (l — T]) 2 



< 



is equivalent to 47/(1 — 77) > 3(77 — 0)(l — 277), which is fulfilled if < j and 77 < 30 
as an elementary computation verifies. 

(v) Using — ln(l— u) < y^-, one obtains 

L>(0||0) = 01n^ + (l -0)m|^ > 01n^ + (l-0)ln(l-0) 

_ _ 
> 01n--0 = 0(ln0-ln0-l) 

(vi) This follows from £>(0||0) < - ln(l-0) < ^ < § . The last two statements 
(vii) and (viii) are even easier. □ 

In the above entropy inequalities we have left out the extreme cases 0, G {0, 1}. 
This is for simplicity and convenience only. Inequalities (i) — (iv) remain valid for 
0,0 G {0,1} if the fraction ^ is properly defined. However, since the extreme 



cases will need to be considered separately anyway, there is no requirement for the 
extension of the lemma. We won't need (vi) and (viii) of Lemma El in the sequel. 

We want to point out that although we have proven Lemma 121 only for the case 
of binary alphabet, generalizations to arbitrary alphabet are likely to hold. In fact, 
(i) does hold for arbitrary alphabet, as shown in [HutOlJ. 

It is a well-known fact that the binomial distribution may be approximated by 
a Gaussian. Our next goal is to establish upper and lower bounds for the binomial 
distribution. Again we leave out the extreme cases. 

Lemma 3 Let i} £ (0, 1) be the true parameter, n > 2 and 1 < k < n — 1, and 
a — -. Then the following assertions hold. 

(i) p(a\n) < 1 = exp ( - nD(a||0 o )), 

y 27ra(l — a)n 

(ii) p{a\n)> 1 exp ( - n-D(g||fl ))- 

y Sa[l — a)n 

The lemma gives a quantitative assertion about the Gaussian approximation to a 
binomial distribution. The upper bound is sharp for n — > oo and fixed a. Lemma El 
can be easily combined with Lemma 121 yielding Gaussian estimates for the Binomial 
distribution. 

Proof. Stirling's formula is a well-known result from calculus. In a refined version, 
it states that for any n > 1 the factorial n\ can be bounded from below and above 
by 

\Jlixn ■ n n exp I — n H I < n\ < \jhxn ■ n n exp ( — n H I . 

p V i2n + 1 y - - 1 V 12n J 

Hence, 



"• "kf-i n \n-k 



< 



n ■ n 



"exp(-y<(l- 



y/2nk(n — k) ■ k k (n — k) n ~ k exp 



i 



12fc+l 1 12(n-A;)+l 

: exp ( — n • D(a|| , i?o) + 



y/27ra(l -a)n \ V2n 12k + 1 12(n - k) + 1 

< 1 exp(-n£>H^ )). 

The last inequality is valid since ^ — 12A | +1 — 12 ( n _fc)+i < f° r ah n an d which 
is easily verified using elementary computations. This establishes (i). 
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In order to show (ii), we observe 
p(a,n) > eyp ( — n ■ D(fy\\i9 ) + 



V27ra(l - a)n V ' 12n + 1 12fc 12(n - k) 

exD fJ_ _ I) 

> ^37 8 y =ex p( -nD(a||^o)) for n > 3. 



A/27Ttt(l — a) 



71 



Here the last inequality follows from the fact that 12r [ +1 — ^ — 12 ^-k) * s minimized 
for n = 3 (and A; = 1 or 2), if we exclude n = 2, and exp(^ — |) > y/n/2. For n = 2 
a direct computation establishes the lower bound. □ 

Lemma 4 Let z G 1R + , i/ien 

(?) —r = < > V n " exp(-z n) < —r H = and 

v n=l v 



(ii) ^ n 2 exp(— z 2 n) < y/n/z. 



n=l 



Proof, (i) The function f{u) = y/uexp(—z 2 u) increases for u < ^2 and decreases 
for u > 2^2. Let N = max{n e N : /(n) > /(n — 1)}, then it is easy to see that 



N-l rN N 



f(u)du < ^/(n)and 

ii=l n=l 
/>00 oo 

/(n) < / /(-u) < /(n) and thus 

n=W+l n=Af 

OO „oo oo 

52f(n)-f(N) < / /(«)d« < ^/(n) + /(7V) 

n=l ^° n=l 



holds. Moreover, / is the derivative of the function 



\/uexp(—z u) 1 r vu , 2N , 

F(u) = ^ ^ + — / exp(-t; 2 ) dv. 

z z Jo 

Observe f(N) < /(^s) = an d Jo°° ex P( — v 2 )dv — ^ to obtain the assertion, 

(ii) The function f(u) = exp(—z 2 u) decreases monotonically on (0, oo) and 



is 



the derivative of F(u) = 2z 1 exp(— v 2 )dv. Therefore, 



x poo 

£/M< / f(u)du = yfr/; 

n=l ^ 



holds. □ 
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3 Lower Bound 

We are now in the position to prove that even for Bernoulli classes the upper bound 
from Theorem ^ is sharp in general. 

Proposition 5 Let $o — \ be the true parameter generating sequences of fair coin 
flips. Assume 6 = {#o, • • • , $2 N -i} where $k = \ + 2~ fc ~ 1 for k > 1. Let all 
complexities be equal, i.e. Kw^o) = • • • = Kw('d2 N -i) — N . Then 

oo 

^E(^ -r) 2 > £(2* -5) ^2 Kw ^\ 

n=l 



Proof. Recall that $ x = $ {a > n) the maximizing element for some observed sequence 
x only depends on the length n and the observed fraction of ones a. In order 
to obtain an estimate for the total prediction error E(-$ — $ x ) 2 , partition the 
interval [0, 1] into 2 N disjoint intervals Ik, such that IJfc=o 1 ^fc = IP> !]■ Then consider 
the contributions for the observed fraction a falling in Ik separately: 

oo 

<?(*) = £ E p(«N(^' n) -^) 2 (6) 

n=i oeA„n/ fc 

(compare (Jl|). Clearly, E(-# — $ x ) 2 = J2k ^ W holds. We define the partitioning 

(4) as J = [0, | + 2" 2JV ) = [0,iV-i), Ji = [|, 1] = [t?!, 1], and 

4 = [tf fc) tf^) for all 2 < Jfe < 2^ - 1. 
Fix G {2, . . . , 2 N — 1} and assume a G Then 

^( a ,n) = argm i n { nj D( a ||^) ln2} = argmin{n£>(a||$)} G {tfk,0 fc _i} 

according to (JTJ). So clearly (-#( Q ' n ) — $ ) 2 > — $ ) 2 = 2 _2A: ~ 2 holds. Since p(a\n) 
decreases for increasing \at — i?o I ; we have p(a\n) > p(i9fc__i |n) . The interval 4 has 
length 2~ k ~ 1 , so there are at least 2 ^ 1 J > n2 — 1 observed fractions a falling 
in the interval. From (JUJ), the total contribution of a G Ik can be estimated by 

oo 

C(k) > ^2- 2k -\n2- k - 1 - l)p(# fc _i|n). 

n=l 

Note that the terms in the sum even become negative for small n, which does not 
cause any problems. We proceed with 

P($k-i\n) > J__ exp[-n£>(§+2- fc |ll)] > JL exp [ - np- 2k ] 
V 8 ■ 2~ 2 n ' y in 
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according to Lemma El and Lemma 121 (ii). By Lemma |H(z) and (ii), we have 



f:^exp[-n|2^] > 

n=l ^ ' 

oo nr 




n=l 



Considering only k > 5, we thus obtain 



T and 



C(k) > 



> 




-2k-2 



3^2' 



16 

1 0-2A-1 



1 



2e 



2e 

v^2~ fe 



- v^2 fe 



2" 5 - 



>/3 



16V2e 



2- u > 



84' 



Ignoring the contributions for k < 4, this implies the assertion. 



□ 



This result shows that if the parameters and their weights are chosen in an 
appropriate way, then the total expected error is of order w^ 1 instead of lnu^ . 
Interestingly, this outcome seems to depend on the arrangement and the weights 
of the false parameters rather than on the weight of the true one. One can check 
with moderate effort that the proposition still remains valid if e.g. Wq is twice as 
large as the other weights. Actually, the proof of Proposition shows even a slightly 
more general result, namely admitting additional arbitrary parameters with larger 
complexities: 



Corollary 6 Let = {$ k : k > 0}, $ = \, $k = \ 

and dk € [0, 1] arbitrary for k > 2 — 1. Let Kw{dk) 
Kw(& k ) > N for k > 2 N - 1. Then £ n E($ 



r)2 > i (2 iv 



2-fc-i for 1 < k <2 N -2, 
N for < k < 2 N - 2 and 
6) holds. 



We will use this result only for Example Other and more general assertions 
can be proven similarly. 



4 Upper Bounds 

Although the cumulative error may be large, as seen in the previous section, the 
instantaneous error is always small. It is easy to demonstrate this for the Bernoulli 
case, to which we restrict in this paper. Much more general results have been 
obtained for arbitrary classes of i.i.d. models |Li991 IZha 04j. Strong instantaneous 
bounds hold in particular if MDL is modified by replacing the factor In 2 in (JTJ) by 
something larger (e.g. (1 + e) In 2) such that complexity is penalized slightly more 
than usually. Note that our cumulative bounds are incomparable to these and other 
instantaneous bounds. 
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Proposition 7 For n > 3, the expected instantaneous square loss is bounded as 
follows: 



E (tf - < ( ln2 )^(^o) + y/2(hx2)Kw(d )hxn + 6 Inn 



2n n n 

r proof for the case $o £ (fjf) only, 
proof of Proposition we consider the contributions of different a separately. By 
Hoeffding's inequality, P(|a — i?o I > -j=) < 2e~ 2c for any c > 0. Letting c = vrn 



Proof. We give an elementary proof for the case $o £ (zjf) only- Like in the 



n. 



the contributions by these a are thus bounded by -4 < —. 

On the other hand, for \a — i? | < recall that $0 beats any i? iff (J3J) holds. 
According to Kw{d) > 0, |a — $o| < an d Lemma 121 («) and (zz), © is already 

implied by \a — 1?| > \J 2( ln2 ) fe ^°) + 3 c ~_ Clearly, a contribution only occurs if $ beats 
i?o, therefore if the opposite inequality holds. Using |a — #o| < 77^ again and the 
triangle inequality, we obtain that 



2 < 5c 2 + |(ln2)Aw(^o) + y/2(ln 2)Kw^ )c 2 
— n 
in this case. Since we have chosen c = y/lnn, this implies the assertion. □ 

One can improve the bound in Proposition [7| to E($o — S xi:n ) 2 < Kw ^o) a 
refined argument, compare |BC91j . But the high-level assertion is the same: Even 
if the cumulative upper bound may be infinite, the instantaneous error converges 
rapidly to 0. Moreover, the convergence speed depends on Kw^q) as opposed to 
2 Kw ^°>. Thus $ tends to $o rapidly in probability (recall that the assertion is not 
strong enough to conclude almost sure convergence). The proof does not exploit 
^2 w ti < 1) only w# < 1, hence the assertion even holds for a maximum likelihood 
estimator (i.e. w$ — 1 for all $ e ©). The theorem generalizes to i.i.d. classes. For 
the example in Proposition El the instantaneous bound implies that the bulk of 
losses occurs very late. This does not hold for general (non-i.i.d.) model classes: 
The total loss up to time n in [PHOJil Example 9] grows linearly in n. 

We will now state our main positive result that upper bounds the cumulative 
loss in terms of the negative logarithm of the true weight and the arrangement of 
the false parameters. The proof is similar to that of Proposition We will only 
give the proof idea here and defer the lengthy and tedious technical details to the 
appendix. 

Consider the cumulated sum square error 

J2 n E(^( Q ' n ) - $ ) 2 - In order to upper 
bound this quantity, we will partition the open unit interval (0, 1) into a sequence of 
intervals {Ik)kLn each of measure 2~ k . (More precisely: Each is either an interval 
or a union of two intervals.) Then we will estimate the contribution of each interval 
to the cumulated square error, 



C(k) =J2 Yl P(a\n)($ (a ' n) - 0, 
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Figure 1: Example of the first four intervals for $ = ^. We have an 1-step, a c-step, 
an 1-step and another c-step. All following steps will be also c-steps. 



(compare (JD) and ©)• Note that $( a ' ra ) £ I k precisely reads *4 a ' n ) e 4 (1 6, but for 
convenience we generally assume $ £ for all $ being considered. This partitioning 
is also used for a, i.e. define the contribution C(k,j) of i? £ 4 where a £ Jj as 

oo 

C(M)=E E p(«|n)(^-^ ) 2 . 

w=i aeA n nlj,&( a ' n )elk 

We need to distinguish between a that are located close to $o and a that are located 
far from $ - "Close" will be roughly equivalent to j > k, "far" will be approximately 
j < k. So we get £„ E(^^ - tf ) 2 = ET=i C{k) = Ej C{k,j). In the proof, 

X 1 

p(a\n) < [na(l — a)} 2 exp [ — nD(a||i?o)] 

X 

is often applied, which holds by Lemma El (recall that / < g stands for / = 0(g)). 
Terms like D(a\\$o), arising in this context and others, can be further estimated 
using Lemma 121 We now give the constructions of intervals I k and complementary 
intervals J k . 

Definition 8 Let !? 6 6 be given. Start with J = [0,1). Let J fe _i = [$ l k ,$ k ) 
and define d k = $ k — $ l k = 2~ k+1 . Then I k , J k C 4-i are constructed from J k _i 
according to the following rules. 

tfo g K, + f 4) =► J k = [& k , 4 + |4), h = K + §4, (7) 

4e [4 + f4,4 + |4) 4 = [4 + i4,4 + |4), (8) 

4 = [4,4 + |4)u[4 + f4,^), 

tf g K + f 4, *J) =► 4 = W + j*, 4), h = & k + |4). (9) 
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We call the kth step of the interval construction an l-step if (JJJ) applies, a c-step if 
(JHJ) applies, and an r-step if Q applies, respectively. Fig. [T] shows an example for 
the interval construction. 

Clearly, this is not the only possible way to define an interval construction. 
Maybe the reader wonders why we did not center the intervals around i? - in fact, 
this construction would equally work for the proof. However, its definition would 
not be easier, since one still has to treat the case where $o is located close to the 
boundary. Moreover, our construction has the nice property that the interval bounds 
are finite binary fractions. 

Given the interval construction, we can identify the $ G Ik with lowest complex- 
ity: 

Definition 9 For $0 G and the interval construction (J^, J^), let 

&l = a,rgmin{Kw(fi) : i? G h H ©}, 
■d J k = oj:gmm.{Kw($) : ■& G Jt H 6}, and 
A(Jfc) = max{Kw(^) -ifty(^),0}. 

If there is no G fl 6, we set A(/c) = Kwi^d 1 ^) = 00. 

We can now state the main positive result of this paper. The detailed proof is 
deferred to the appendix. Corollaries will be given in the next section. 

Theorem 10 Let C [0, 1] be countable, $ G 6, and w# = 2~ Kw ^\ where Kw{$) 
is some complexity measure on 0. Let A(k) be as introduced in Definition^ and 
recall that $ x = ■$( a ' n ) depends on x 's length and observed fractions of ones. Then 

00 00 

E (^o - $ X Y < Kw($ ) + 2_A(fc) V 7 ^)- 

n=l k=l 

5 Uniformly Distributed Weights 

We are now able to state some positive results following from Theorem 1 1(JI 

Theorem 11 Let C [0, 1] be a countable class of parameters anddo G the true 
parameter. Assume that there are constants a > 1 and b > such that 

min (KwOff) : ■& G [#„ - 2~\ tf + 2~ fc ] n9,i?^ }> — - (10) 

a 

/ioWs /or a// fc > aKw{do) + 6. T/ien we /iave 

00 

E (^o - ^) 2 < aKw(0 o ) + b< Kw{d Q ). 

n=l 
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Proof. We have to show that 

oo 

2 ~ A(k) V^k) < aKwt&o) + b, 



k=i 



then the assertion follows from Theorem E3 Let k\ = \aKw($o) +6+1] and 
k' — k — k\. Then by Lemma IT7I (Hi) and (fTUjl we have 



fe=i fe=i fe=fci+i 



fc=fe 1 +l 



x k-b I k — b 



&w,<) > fc'+fci-ft k' + k\ — b 
< ki + 2 i& " (A)) 2^ 2 = — ' 



fc'=i 

oo 



< aKw(& ) + b + 2 + y^2-£\—+Kw('&o) 



As already seen in the proof of Theorem EH + #u>($ ) < yj ~ + y/l&v(#ii) , 

fc ' x fc / /— X 

2 _ ~ < a, and 2 vw ^- < a hold. The latter is by Lemma 0] (i). This 



implies the assertion. □ 

Letting j = fjl(jp asserts that parameters i9 with complexity Kw{§) = j 
must have a minimum distance of 2 -ja ~ 6 from i? - That is, if parameters with equal 
weights are (approximately) uniformly distributed in the neighborhood of $0; in the 
sense that they are not too close to each other, then fast convergence holds. The 
next two results are special cases based on the set of all finite binary fractions, 

Qm* ={$ = O.Pifo . . . Pn-il ■ n e N, A e B} u {0, 1}. 

If $ = Q.P1P2 ■ ■ ■ Pn-il e Qb*, its length is Z($) = n. Moreover, there is a binary code 
P[... (3' n , for 7i, having at most n' < [log 2 (ri+l)J bits. Then 0p[0p' 2 . . . 0P' n ,iPi . . . (3 n -i 
is a prefix-code for For completeness, we can define the codes for $ = 0, 1 to be 
10 and 11, respectively. So we may define a complexity measure on Q B » by 

Kw(0) = 2, Kw(l) = 2, and Kw(&) = + 2[log 2 (/(i?) + 1)J for ^ 0, 1. (11) 

There are other similar simple prefix codes on Qb* with the property Kw{$) > 

Corollary 12 Lei 9 = Qj,, 4 6 6 and 2ftt;(#) > for all ■& G 9, and reca// 

= #(a,n)_ T/ien ^ n E(^ - < ^W(^o) 



15 



Proof. Condition (fTUj) holds with a = 1 and 6 = 0. □ 

This is a special case of a uniform distribution of parameters with equal complex- 
ities. The next corollary is more general, it proves fast convergence if the uniform 
distribution is distorted by some function ip. 

Corollary 13 Let ip : [0, 1] — > [0, 1] be an injective, N times continuously differen- 
tiable function. Let 6 = ^(Qb*), Kw(ip(t)) > l(t) for all t G Qb*, and $o = V^o) 
for a to G Qb* . Assume that there is n < N and e > such that 



d n ip 



dt n 



(t) > c > for all t E [to — e, t + e\ and 
d m ip 



dt r - 



-(to) 



for all 1 < m < n. 



Then we have 



$ x y < nKw($ ) + 21og 2 (n!) - 21og 2 c + n\og 2 e < nKw($ ). 



Proof. Fix j > Kw($ ), then 

Kw((p(t)) > j for all t G [t - 2~>, t + 2~ j ] n Qb*- (12) 
Moreover, for all t G [to — 2~- J , t + 2 _jf ], Taylor's theorem asserts that 

rft) = tpfo) + -2±L{t - t ) n (13) 
n\ 

for some t in (to,t) (or (t, to) if t < to). We request in addition 2~ J < e, then 
\~m*t\ > c by assumption. Apply (fT3|) to t = to + 2~i and t = t — 2~i and define 
k = \jn + log 2 (n!) — log 2 c] in order to obtain \ip(to + 2 _ - J ) — # | > 2~ k and \<p(tQ — 
2~ j ) - i?ol > 2- k . By injectivity of <p, we see that tp(t) <£ [#„ - 2" fe ,^o + 2~ fc ] if 
t ^ [to — 2~Vo + 2- j ]. Together with flTJ]), this implies 

Kw{0) > j > fc - lQ g2N) + log 2 c-l for aU ^ e _ 2 _ fc ^ + 2 _ fc] n Q 

n 

This is condition (jl(Jj) with a = n and h = log 2 (n!) — log 2 c+l. Finally, the assumption 
2 - - 7 < e holds if k > ki = nlog 2 e + log 2 (n!) — log 2 c + 1. This gives an additional 
contribution to the error of at most k\. □ 

Corollary EH shows an implication of Theorem El for parameter identification: 
A class of models is given by a set of parameters Qb* and a mapping ip : Q B * — > 
6. The task is to identify the true parameter to or its image $o = f(to)- The 
injectivity of ip is not necessary for fast convergence, but it facilitates the proof. 
The assumptions of Corollary El are satisfied if ip is for example a polynomial. In 
fact, it should be possible to prove fast convergence of MDL for many common 
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parameter identification problems. For sets of parameters other than Qb*, e.g. the 
set of all rational numbers Q, similar corollaries can easily be proven. 

X 

How large is the constant hidden in "<"? When examining carefully the proof of 
Theorem I1U1 the resulting constant is quite huge. This is mainly due to the frequent 
"wasting" of small constants. The sharp bound is supposably small, perhaps 16. On 
the other hand, for the actual true expectation (as opposed to its upper bound) and 
complexities as in (jlljl . numerical simulations show ^ n E(-$o — $ x ) 2 < \Kw($q). 

Finally, we state an implication which almost trivially follows from Theorem El 
but may be very useful for practical purposes, e.g. for hypothesis testing (compare 
|Ris99j ). 

Corollary 14 Let contain N elements, Kw(-) be any complexity function on O, 
and $o G 0. Then we have 

oo 
n=l 

Proof. J2k 2 ~ AW V A ( k ) < N ™ obvious. □ 

6 The Universal Case 

We briefly discuss the important universal setup, where Kw(-) is (up to an additive 
constant) equal to the prefix Kolmogorov complexity K (that is the length of the 
shortest self-delimiting program printing i? on some universal Turing machine). Since 
Y2k 2~ K<yk ^ a/ K(k) = oo no matter how late the sum starts (otherwise there would 
be a shorter code for large k), Theorem El does not yield a meaningful bound. 
This means in particular that it does not even imply our previous result, Theorem 
E But probably the following strengthening of Theorem El holds under the same 
conditions, which then easily implies Theorem E up to a constant. 

Conjecture 15 £ n E(tf - ■&*)* < K{& ) + E fe 2- A(fc) . 

Then, take an incompressible finite binary fraction i9 6 Qb*, i-e. K($q) = 
K$o) + K(K$q))- For k > Z(i9 ) , we can reconstruct $0 an d k from and I ('do) by 

just truncating after Z(tf ) bits. Thus K(ti{)+K(l(ti )) > K(& )+K{k\-& ,K('&o)) 
holds. Using Conjecture HHJ we obtain 

E(0„ - < K($o) + 2* (W) < /(^o)(log 2 /(^o)) 2 , (14) 

n 

where the last inequality follows from the example coding given in (jlljl . 
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So, under Conjecture HSJ we obtain a bound which slightly exceeds the com- 
plexity K(do) if $0 has a certain structure. It is not obvious if the same holds for 
all computable $o- I n order to answer this question positive, one could try to use 
something like |Gac83t Eq.(2.1)]. This statement implies that as soon as K(k) > K\ 

for all k > fci, we have J2 k>kl 2~ K ^ < 2~ Kl K 1 (\og 2 K 1 ) 2 . It is possible to prove 
an analogous result for instead of k, however we have not found an appropriate 
coding that does without knowing i? - Since the resulting bound is exponential in 
the code length, we therefore have not gained anything. 

Another problem concerns the size of the multiplicative constant that is hidden 
in the upper bound. Unlike in the case of uniformly distributed weights, it is now of 
exponential size, i.e. 2 ' 1 '. This is no artifact of the proof, as the following example 
shows. 

Example 16 Let U be some universal Turing machine. We construct a second 
universal Turing machine U' from U as follows: Let N > 1. If the input of U' is 
l N p, where 1^ is the string consisting of N ones and p is some program, then U 
will be executed on p. If the input of U' is 0^, then U' outputs |. Otherwise, if the 
input of U' is x with x G \ {0^, 1*}, then U' outputs \ + 2~ x ~ 1 . For tf = §, the 
conditions of Corollary El are satisfied (where the complexity is relative to U'), thus 

Can this also happen if the underlying universal Turing machine is not "strange" 
in some sense, like U', but "natural"? Again this is not obvious. One would have 
to define first an appropriate notion of a "natural" universal Turing machine which 
rules out cases like U' . If N is of reasonable size, then one can even argue that U' 
is natural in the sense that its compiler constant relative to U is small. 

There is a relation to the class of all deterministic (generally non-i.i.d.) measures. 
Then MDL predicts the next symbol just according to the monotone complexity Km, 
see |Hut03cj . According to [Hut03cl Theorem 5], 2~ Km is very close to the universal 
semimeasure M 1/1.701 ILev73j . Then the total prediction error (which is defined 
slightly differently in this case) can be shown to be bounded by 2°^ Km(x <00 ) 3 
[Hut04j. The similarity to the (unproven) bound (j!4|) "huge constant x polynomial" 
for the universal Bernoulli case is evident. 

7 Discussion and Conclusions 

We have discovered the fact that the instantaneous and the cumulative loss bounds 
can be incompatible. On the one hand, the cumulative loss for MDL predictions 
may be exponential, i.e. 2 Kw ^°\ Thus it implies almost sure convergence at a 
slow speed, even for arbitrary discrete model classes |PH04a| . On the other hand, 
the instantaneous loss is always of order -Kw($o), implying fast convergence in 
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probability and a cumulative loss bound of Kw($q) Inn. Similar logarithmic loss 
bounds can be found in the literature for continuous model classes |Ris96j. 

A different approach to assess convergence speed is presented in [BC91J. There, 
an index of resolvability is introduced, which can be interpreted as the difference of 
the expected MDL code length and the expected code length under the true model. 
For discrete model classes, they show that the index of resolvability converges to 
zero as ±Kw{$ Q ) |BC911 Equation (6.2)]. Moreover, they give a convergence of 
the predictive distributions in terms of the Hellinger distance |BC91[ Theorem 4]. 
This implies a cumulative (Hellinger) loss bound of Kw("&o) hm and therefore fast 
convergence in probability. 

If the prior weights are arranged nicely, we have proven a small finite loss bound 
Kw('&o) for MDL ( Theorem II 0|). If parameters of equal complexity are uniformly 
distributed or not too strongly distorted ( Theorem HP and Corollaries), then the 
error is within a small multiplicative constant of the complexity Kw($o). This may 
be applied e.g. for the case of parameter identification (Corollary ITH|) . A similar 
result holds if is finite and contains only few parameters (Corollary lT4*j) . which 
may be e.g. satisfied for hypothesis testing. In these cases and many others, one can 
interpret the conditions for fast convergence as the presence of prior knowledge. One 
can show that if a predictor converges to the correct model, then it performs also 
well under arbitrarily chosen bounded loss-functions [Hut03a, Theorem 4]. From an 
information theoretic viewpoint one may interpret the conditions for a small bound 
in Theorem [IJ3 as "good codes" . 

We have proven our positive results only for Bernoulli classes, of course it would 
be desirable to cover more general i.i.d. classes. At least for finite alphabet, our 
assertions are likely to generalize, as this is the analog to Theorem ^ which also 
holds for arbitrary finite alphabet. Proving this seems even more technical than 
Theorem El and therefore not very interesting. (The interval construction has to 
be replaced by a sequence of nested sets in this case. Compare also the proof of the 
main result in |Ris96j .) For small alphabets of size A, meaningful bounds can still 
be obtained by chaining our bounds A — 1 times. 

It seems more interesting to ask if our results can be conditionalized with respect 
to inputs. That is, in each time step, we are given an input and have to predict a 
label. This is a standard classification problem, for example a binary classification 
in the Bernoulli case. While it is straightforward to show that Theorem ^ still holds 
in this setup }PH05j . it is not clear in which way the present proofs can be adapted. 
We leave this interesting question open. 

We conclude with another open question. In abstract terms, we have proven a 
convergence result for the Bernoulli case by mainly exploiting the geometry of the 
space of distributions. This has been quite easy in principle, since for Bernoulli this 
space is just the unit interval (for i.i.d it is the space of probability vectors). It is not 
at all obvious if this approach can be transferred to general (computable) measures. 
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A Proof of Theorem fTUl 



The proof of Theoremlinirequires some preparations. We start by showing assertions 
on the interval construction from Definition |H1 

Lemma 17 The interval construction has the following properties. 

(i) \Jk\=2~ k , 
(zz) d(# ,I k ) >2~ fc - 2 , 
(m) max|^-^ | < 2- fe+1 , 

(iv) d(J k+5 ,I k ) > 15-2 



-fc-6 



By c?(-, •) we mean the Euclidean distance: d($,I) = min{\i) — i?| : $ G /} and 
d(J,7) = min{d(^,7) : G J}. 

Proof. The first three equations are fairly obvious. The last estimate can be 
justified as follows. Assume that kth step of the interval construction is a c-step, 
the same argument applies if it is an 1-step or an r-step. Let c be the center of J k 
and assume without loss of generality i?o < c - Define = max{$ G /&: : i? < c} and 
$j = min{^ G Jfc+5} (recall the general assumption G for all $ that occur, i.e. 

G 6). Then $r = c — 2~ fc_1 and $j > c — 2~ k ~ 2 — 2~ fe ~ 6 , where equality holds 
if tf = c - 2- fc - 2 . Consequently, $j-$i> T k - X - 2~ k ~ 2 - 2~ k - G = 15 • 2~ k ~ 6 . This 
establishes the claim. □ 

Next we turn to the minimum complexity elements in the intervals. 
Proposition 18 The following assertions hold for all k > 1. 

(i) Kw{p J k ) < Kw(0 o ), 

(ii) Kw(V J k+6 ) > Kw{d J k ), 



{in) 



(IV 



^max{i^(^ +5 ) - Kw(4),0} < 6Kw{& ), 



k=l 



Proof. The first three inequalities follow from $0 G J k and J k +e, h+i C J k . This 
implies 



J2^x{Kw(^ 6j+6 )-Kw(^ j+1 ),0} 



3=0 



< max {Kw{$D - Kw{-&[), 0} + max { K ^ij+a) ~ Kw (K)> °) 

3=1 

rn 

20 



for all m > 0. By the same argument, we have 

^max{i^«. +i+5 ) - Kw(4 j+i ),0} < Kw($ ) 

3=0 

for all 1 < i < 6 (use (m) in the first inequality, (ii) in the second, and (i) in the 
last). This implies (iv). Clearly, we could everywhere substitute 5 by some constant 
k' and 6 by k! + 1, but we will need the assertion only for the special case. □ 

Consider the case that $o is located close to the boundary of [0, 1]. Then the 
interval construction involves for long time only 1-steps, if we assume without loss 
of generality #o < §■ We will need to treat this case separately, since the estimates 
for the general situation work only as soon as at least one c-step has taken place. 
Precisely, the interval construction consists only of 1-steps as long as 

tf <|2- fc , i.e. k< -l O g 2 tf + log 2 (f). 

We therefore define 

k = max{0, [-log 2 ^ + log 2 |J} (15) 

and observe that the (k + l)st step is the first c-step. We are now prepared to give 
the main proof. 

Proof of Theorem ITU1 Assume $0 G 6 \ {0, 1}, the case $0 € {0, 1} is handled 
like Case la below and will be left to the reader. 

Before we start, we will show that the contribution of 1? = 1 to the total error is 
bounded by j. This is immediate, since 1 cannot become the maximizing element 
as soon as x 7^ l n . Therefore the contribution is bounded by 

00 00 

£(1 - tf ) 2 p(l n ) = (1 - tfo) 2 J>2 = Ml - tfo) < \- (16) 

n=l n=l 

The same is true for the contribution of d = 0. 

As already mentioned, we first estimate the contributions of 1? E Ik for small k 
if the true parameter ^0 is located close to the boundary. To this aim, we assume 
$0 < \ without loss of generality. We know that the interval construction involves 
only 1-steps as long as k < k , see (fTHjh The very last five of these k still require a 
particular treatment, so we start with k < k — 5 and a is far from i? - (If Afo — 5 < 1, 
then there is nothing to estimate.) 

Case la: k < k — 5, j < ki, a e Ij = [2~ j , 2~ j+1 ), where k\ = k + |~log 2 (£;o — 
k — 3)] + 2. The probability of a does not exceed p(2 _J ). The squared error may 
clearly be upper bounded by 2~ 2k+2 = 0(2~ 2k ). For n < 2\ no such fractions can 
occur, so we may consider only n = 2 J + n', n' > 0. Finally, there are at most 
\n ■ 2~' 7 ~ 1 ] = 0(2~ J n) fractions a G Ij. This follows from the general fact that if 
/ C (0, 1) is any half-open or open interval of length at most /, then at most \nl] 
observed fractions can be located in I. 
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We now derive an estimate for the probability which is 

p(a\n) < p(2~ j \n) < n~hi exp [- n ■ D(2- j \\0 o )] 

according to Lemma El Then, Lemma (v) implies 

exp [ - nD(2~ j \\-d Q )\ < exp [ - (2 j + n')D(2~ j \\2' k °)} < exp [n'2~ j (k -j - 1)]. 

Taking into account the upper bound for the squared error 0(2~ 2k ) and the maxi- 
mum number of fractions 0(2"%), the contribution C(k,j) can be upper bounded 
by 

oo oo 

C(k,j) < p{2' 3 \n)2- 2k ■ 2-% < 2- 2k ~^Vn-exp [n'2- j (k -j - 1)]. 

n=2i n'=0 

Decompose the right hand side using ^fn < V2J + \fn! '. Then we have 

oo 

^2- 2fc -2 V ^7.exp[n , 2^'(A;o-J - 1)] < 2~ 2k+j (k — j — and 

n'=0 
oo 

^2- 2fe -2 V ^. exp [ n '2- J (A;o-j - 1)] < 2- 2k+3 (k - j - 

n'=0 

where the first inequality is straightforward and the second holds by Lemma IH(i). 
Letting k' = ko — k — 3, we have k' > 2 and 

(k -j - i)-s < (k -j - < (k -h- 1)- 1 = (k' - Rog^'ir 1 . 

Thus we may conclude 

x fe+rio g2 fe'i+2 2k+j 

C(k,< kl ) := g(7(M < g (17) 

< 2 - fc ^ < 2-* (\ I ri ° g2fcn ^ < 3-2- fc 

fc'-flog^l " V fc'-flog 2 fc'V " 

(the last inequality is sharp for k' — 3). 

Case lb: jfe < A; - 5, a < 2~ fcl (recall h = k + \\og 2 (k - k - 3)] + 2). This 
means that we consider a close to i? - By (J3J) we know that $0 beats $ 6 I*. if 

n ■ D a ($ \\$) > \n2(Kw(ti ) ~ Kw(ti)) 

holds. This happens certainly for n > iVj := In 2 • Kw('&q) ■ 2 fc+4 , since Lemma I2TH 
below asserts D a ( , & \\'d) > 2 -2-4 . Thus only smaller n can contribute. The total 
probability of all a < 2~ kl is clearly bounded by means of 



^^p(a\n) < 1. 
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The jump size, i.e. the squared error, is again 0(2 2k ). Hence the total contribution 
caused in I& by a < 2~ kl can thus be upper bounded by 

x Nl 

C(k, >h) < ^2~ 2fe < Kw($ )2- k , 

n=l 

where C(k, > k\) is the obvious abbreviation for this contribution. Together with 

X 

f|17|) this implies C{k) < Kw($o)2~ and therefore 

feo-5 x 

^2c(k) < Kw($ ). (18) 
k=i 

This finishes the estimates for k < ko — 5. We now will consider the indices 

h - 4 < k < k 

and show that the contributions caused by these $ G h is at most O^Kwi^o)). 

Case 2a: ko — 4 < k < ko, j '• < k + 5, a G Ij. Assume that d G Ik starts 
contributing only for n > no- This is not relevant here, and we will set uq = for 
the moment, but then we can reuse the following computations later. Consequently 
we have n = n + n' , and from Lemma El we obtain 

p(a\n) < n~22^~ exp [ - (n + n) ■ D(a\\fi Q )]. (19) 
Lemma IT7I implies d(a,$o) > 2~ J_2 and thus 

D(a\Wo) > |^ = 2~ 2 ^. (20) 

according to Lemma El (iii) ■ Therefore we obtain 

exp [ - (n + n) ■ D(a\\$ Q )] < exp [ - n ■ D(a\\0 Q )] exp [ - n'2- 2j - 5+k °}. (21) 

Again the maximum square error is 0(2~ 2k ), the maximum number of fractions is 
0(n2~i). Therefore 

oo 

C(k,j) < exp [ - n D(a\\$ )} 2" 2fc_i+ ^ VnH^exp [ - n'2- 2j - 5+k °}. (22) 

n'=l 

We have 

oo 

2- 2k -^ exp [ - n'2- 2 J- 5+k0 ] < 2- 2k+ ^ < 2~ 2k +i and (23) 

n'=l 

oo 

J2^ 2k ' J+ ^^ J exp[-n'2- 2 ^- 5+k0 ] < 2 - 2k+2 i~ k0 <2- 2k+2 \ (24) 

n'=l 
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where the first inequality is straightforward and the second follows from Lemma 0] 
(i). Observe ^jtl % j < 2 fc , 2 2j < 2 2/ \ and V™ < V™o+ vV in order to obtain 

C(k, < k + 5) < exp [ - n Q L>H|tf )](l + 2"*^. (25) 

The right hand side depends not only on k and no, but formally also on a and even 
on since no itself depends on a and Recall that for this case we have agreed 
on n = 0, thus C(k, < k + 5) = 0(1). 

Case 2b: /c — 4 < < ko, ot G Jfc+5. As before, we will argue that then 
d G Ik can be the maximizing element only for small n. Namely, $0 beats d if 
n • D a (tf ||tf) > ln2(Kw(-& ) - #tu(0)) holds. Since D a (0 o ||0) > 2~ 2fe " 5 as stated in 
Lemma l2*Ul below, this happens certainly for n > N\ := In 2 ■ Kw($o) ■ 2 2fc+5 , thus 
only smaller n can contribute. Note that in order to apply Lemma EOl we need 
k > k — 4. Again the total probability of all a is at most 1 and the jump size is 
0{2~ 2k ) ) hence 

iVi 

0(&, > k + 5) < 2 ~ 2h < Kw($ ). 

n=l 

x 

Together with C(k, < k + 5) = 0(1) this implies C(k) < Kw^q) and thus 

C(k) < Kwi'&o). (26) 

k=ko— 4 

This completes the estimate for the initial 1-steps. We now proceed with the 
main part of the proof. At this point, we drop the general assumption $0 < so 
that we can exploit the symmetry otherwise if convenient. 

Case 3a: k > k Q + 1, j < k + 5, a G Ij. For this case, we may repeat the 
computations (fT^j) - (|23|) . arriving at 

C(k, <k + 5) <exp[-n ^(a||^o)](l + 2- fc v ^). (27) 

The right hand side of (}2T|) depends on k and n and formally also on a and We 
now come to the crucial point of this proof: 

For most k, n is considerably larger than 0. 

That is, for most k, $ G Ik starts contributing late, i.e. for large n. This will cause 
the right hand side of (|27j) to be small. 

We know that $0 beats $ G Ik for any a G [0, 1] as long as 

nD a {-&\\$ ) < hi2{Kw{-&) - Kw($ )) (28) 
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holds. We are interested in for which n this must happen regardless of a, so assume 
that a is close enough to $ to make .D a (0||0o) > 0. Since Kwifi) > Kw^l), we see 
that (EEJ) holds if 

n , In2-A(ife) 
n - no(fclM):= ^)' 

We show the following two relations: 

exp[-n (k,a,$)D(a\\$ Q )} < 2" A(fc) and (29) 
exp [ - n (k, a, $)D(a\\$ )]2~ "VMK a, W) < 2^^/Ajk), (30) 

regardless of a and 0. Since D(q||t? ) > £>H|i? ) - #H|0) = £> a (0||0 O ), (H is 
immediate. In order to verify 1)30 J), we observe that 

£>(a||i? ) > 2~ 2 ^ 5+fco > 2 - 2fe - 15+fc o > 2 - 2fc - 15 
holds as in (|2Uj) . So for those a and having 

2-2fc-15 

" := d^K) a (31) 

we obtain 

exp [ - n (fc, a, i?)D(a||i? )]2 _ V"o( fc > a, J) < 2- A ( k)r >2~ k y/\a2- A{k)r]2 2k + 15 

since 7] > 1. If on the other hand flHU) is not valid, then D a ($\\tf J k+5 ) < 2~ 2k holds, 
which together with D(a||i?o) > D a ($\\$ ) again implies (j30|) . 

So we conclude that the dependence on a and $ of the right hand side of (j2ZJ) is 

indeed only a formal one. So we obtain C(k, < k + 5) < 2~ A ^ V /A(k), hence 



oo 

x 



C{k, <k + 5)<J2 2 ~ A{k) ^W)- (32) 

k=k +l fc=l 

Case 3b: fc>^+l, cue </fc+5- We know that beats if 

n > In 2 • max{iftt;(^ +5 ) - Kw(ti),0} ■ 2 2fe+5 , 

since D a > 2~ 2fe ~ 5 according to Lemma I2TH Since Kw(i!)) > Kw^j.), this 

happens certainly for n > N% := In 2 ■ max {Kw(${ +5 ) — Kw^fyjO} ■ 2 2fc+5 . Again 
the total probability of all a is at most 1 and the jump size is 0(2~ 2fe ). Therefore 
we have 

C{k, >k + 5)< ^2- 2k < max {Kw(fr J k+5 ) - Kw(# J h ),0}. 

n=l 
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Using Proposition UH1 (iv) . we conclude 

oo 

C(k,>k + 5)<Kw($ ). (33) 

fc=fc +i 

Combining all estimates for C(k), namely (jTSjl . (j2Sj) ; and the assertion 
follows. □ 

Lemma 19 Let 1 < k < k - 5 ; fa = k + [log 2 (A: - k - 3)] + 2, ■& > 2~ k , and 
a < 2- kl . Then D a ($ \\$) > 2" fc " 4 holds. 

Proof. By Lemma |2| (m) and (vii), we have 

D(aP) > D(2-^\\2- k ) > 2 . ni _ 2k) 

> 2~ k ~ 1 (l - 2-H«sa(*o-fc-3)l-2) > 7 . 2 - fc - 4 and 

D(a\\A) < D(2- kl \\2- k "- 1 ) < 2- kl (k Q + l-fa) 

< 2 -k-2 k o ~ k ~ Rog 2 (feo - fc - 3)1 - 1 < Q2 -k-4 
~ A; — — 3 ~~ 

(the last inequality is sharp for k = k — 5). This implies .D a ($o||$) = D(a\ 
D(a\\tf ) > 2~ fc " 4 . □ 

Lemma 20 Let k > k - 4, i? G 4, and a,i? G Jfe+ 5 . T/jen we /iawe L> a ($||$) > 

Proof. Assume $ < | without loss of generality. Moreover, we will only present 
the case $ < i? < ~, the other cases are similar and simpler. From Lemma El (Hi) 
and (iv) and Lemma IT7I we know that 

„, „„* (a-tf) 2 15 2 2- 21 - 12 , 

B < Q » £ Sp^) £ ~ and 

D{a p } < 3(a-tf)' < 4- 8- 2-^" < 2.128.2- 2 '- 
v 11 ; - 2ot(l-a) ~ 3 -2a ~ # 

Note that in order to apply Lemma 121 (iv) in the second line we need to know 
that for k + 5 a c-step has already taken place, and the last estimate follows from 
$ < 128a which is a consequence of k > ko — 4. Now the assertion follows from 
D a (&\\$) = D(a\\#) - D(a\\$) > 2- 2fc ~ 6 (15 2 2- 7 - l)^ 1 > 2~ 2fc - 5 . □ 
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